HappyDB is a corpus of 100,000 crowd-sourced happy moments via Amazon’s Mechanical Turk. You can read more about it on https://arxiv.org/abs/1801.07746.

Here, we explore this data set and try to answer the question, “What makes people happy?”

Step 0 - Load all the required libraries

From the packages’ descriptions:

library(tidyverse)
library(tidytext)
library(DT)
library(scales)
library(wordcloud2)
library(gridExtra)
library(ngram)
library(shiny) 
library(wordcloud)

Step 1 - Load the processed text data along with demographic information on contributors

We use the processed data for our analysis and combine it with the demographic information available.

hm_data <- read_csv("../output/processed_moments.csv")

urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/demographic.csv'
demo_data <- read_csv(urlfile)

Combine both the data sets and keep the required columns for analysis

We select a subset of the data that satisfies specific row conditions.

hm_data <- hm_data %>%
  inner_join(demo_data, by = "wid") %>%
  select(wid,
         original_hm,
         gender, 
         marital, 
         parenthood,
         reflection_period,
         age, 
         country, 
         ground_truth_category, 
         text) %>%
  mutate(count = sapply(hm_data$text, wordcount)) %>%
  filter(gender %in% c("m", "f")) %>%
  filter(marital %in% c("single", "married")) %>%
  filter(parenthood %in% c("n", "y")) %>%
  filter(reflection_period %in% c("24h", "3m")) %>%
  mutate(reflection_period = fct_recode(reflection_period, 
                                        months_3 = "3m", hours_24 = "24h"))
datatable(hm_data)
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

# Level I: Simple wordcloud analysis

As a graduate student who are in their early 20s, we are interested in what will make us happy when we are in our 30s. More specifically, we want to see what is the difference between our 20s and 30s when we further diversify the sample into gender and parenthood. Here are some sample questions that we wish to answer after finishing our analysis: 1. What makes us happy 20s vs. 30s? 2. What makes us happy 20s(parenthood) vs. 30s(parenthood)? 3. What makes us happy 20s (male) vs. 30s (male)?

hm_data$num_age <-as.numeric(hm_data$age)
## Warning: NAs introduced by coercion
hm_data<-hm_data[is.na(hm_data$num_age)!=T,]
age_between_20_30 <- hm_data[hm_data$num_age>=20&hm_data$num_age<30,]
age_between_30_40 <- hm_data[hm_data$num_age>=30&hm_data$num_age<40,]

Create a bag of words (1) using the text data

bag_of_words <-  hm_data %>%
  unnest_tokens(word, text)
word_count <- bag_of_words %>%
  count(word, sort = TRUE)

word_age_between_20_30 <-age_between_20_30 %>%
  unnest_tokens(word, text)
word_count_age_between_20_30 <- word_age_between_20_30 %>%
  count(word,sort = TRUE)

word_age_between_30_40 <-age_between_30_40 %>%
  unnest_tokens(word, text)
word_count_age_between_30_40 <- word_age_between_30_40 %>%
  count(word,sort = TRUE)
wordcloud(words=word_count_age_between_20_30$word,freq=word_count_age_between_20_30$n,
           min.freq=10, max.words= 100, random.order = F, rot.per = .35, colors = brewer.pal(8,"Dark2"))

wordcloud(words=word_count_age_between_30_40$word,freq=word_count_age_between_30_40$n,
           min.freq=10, max.words= 100, random.order = F, rot.per = .35, colors = brewer.pal(8,"Dark2"))

As we can see from the wordcloud that for people in their 20s, friend/day/time are most frequent things showing up in the survey. And the other more frequent things are game/birthday/enjoyed/fun. While as for people in their 30s, once again, day/friend/time are most frequent. However, the other more frequent things are a bit different whic are home/family/son/wife/feel/fun…From the output we can infer when people aging from 20 to 30ish, their concentrations of life shift to family yet things make them happy in their 20s are still playing an important role in their lifef making them happy.

age_between_20_30_parenthood <- age_between_20_30[age_between_20_30$parenthood=="y",]
age_between_30_40_parenthood <- age_between_30_40[age_between_30_40$parenthood=="y",]

Create a bag of words using the text data

bag_of_words <-  hm_data %>%
  unnest_tokens(word, text)
word_count <- bag_of_words %>%
  count(word, sort = TRUE)

word_age_between_20_30_parenthood <-age_between_20_30_parenthood %>%
  unnest_tokens(word, text)
word_count_age_between_20_30_parenthood <- word_age_between_20_30_parenthood %>%
  count(word,sort = TRUE)

word_age_between_30_40_parenthood <-age_between_30_40_parenthood %>%
  unnest_tokens(word, text)
word_count_age_between_30_40_parenthood <- word_age_between_30_40_parenthood %>%
  count(word,sort = TRUE)
wordcloud(words=word_age_between_20_30_parenthood$word,freq=word_count_age_between_20_30_parenthood$n,
           min.freq=10, max.words= 100, random.order = F, rot.per = .35, colors = brewer.pal(8,"Dark2"))

wordcloud(words=word_count_age_between_30_40_parenthood$word,freq=word_count_age_between_30_40_parenthood$n,
           min.freq=10, max.words= 100, random.order = F, rot.per = .35, colors = brewer.pal(8,"Dark2"))

As we can see for people in their 20s with parenthood, what makes them happy is something like son/movie/fun/run/time which is not too different from those without parenthood. Maybe is because they are new parents? Or maybe they are just beginning learning how to be a great parent?

In contrast, for people in their 30s with parenthood, they are more mature folks. What makes them happy is somehing like Son/family/husband/wife/life/kids. They are more family oriented. It makes sense because after years training/practice to be great parent, being family oriented is an important factor that should be attributed to.

In addition, I have to point out that there might be some bias in the sample which can be explained properly. For example, anger makes people happy? This does not make any sense generally.

age_between_20_30_male <- age_between_20_30[age_between_20_30$gender=="m",]
age_between_30_40_male <- age_between_30_40[age_between_30_40$gender=="m",]

Create a bag of words using the text data

bag_of_words <-  hm_data %>%
  unnest_tokens(word, text)
word_count <- bag_of_words %>%
  count(word, sort = TRUE)

word_age_between_20_30_male <-age_between_20_30_male %>%
  unnest_tokens(word, text)
word_count_age_between_20_30_male <- word_age_between_20_30_male %>%
  count(word,sort = TRUE)

word_age_between_30_40_male <-age_between_30_40_male %>%
  unnest_tokens(word, text)
word_count_age_between_30_40_male <- word_age_between_30_40_male %>%
  count(word,sort = TRUE)
wordcloud(words=word_age_between_20_30_male$word,freq=word_count_age_between_20_30_male$n,
           min.freq=10, max.words= 100, random.order = F, rot.per = .35, colors = brewer.pal(8,"Dark2"))

wordcloud(words=word_count_age_between_30_40_male$word,freq=word_count_age_between_30_40_male$n,
           min.freq=10, max.words= 100, random.order = F, rot.per = .35, colors = brewer.pal(8,"Dark2"))

For people like me who are in their 20s and are male, what makes them happy is something like examination/birthday/night/girl:) This is not a surprise for me. However,what surprises me is that son is the most frequent word here. As a result, I certainly doubt the sample or the result since this is a bit weird for a society that is considered having gender equality.

Well, things will change when we (male in 20s) turn to our 30s. We will become more reliable gentlemen. What makes us happy is friend/family/home…I look forward getting aged to some extend based on the output.

Level II: Sentiment Analysis

age_between_20_30%>%
  unnest_tokens(word, text)%>%
  select(word)%>%
  inner_join(get_sentiments("nrc"))%>%
  count(word,sentiment, sort=T)%>%
  ungroup()%>%
  group_by(sentiment)%>%
  top_n(10)%>%
  ungroup()%>%
  mutate(word=reorder(word, n))%>%
  ggplot(aes(word, n, fill=sentiment))+
  geom_col(show.legend = F)+
  facet_wrap(~sentiment,scales = "free_y")+
  labs(y="conttribution", 
       x=NULL)+
  coord_flip()
## Joining, by = "word"
## Selecting by n

age_between_30_40%>%
  unnest_tokens(word, text)%>%
  select(word)%>%
  inner_join(get_sentiments("nrc"))%>%
  count(word,sentiment, sort=T)%>%
  ungroup()%>%
  group_by(sentiment)%>%
  top_n(10)%>%
  ungroup()%>%
  mutate(word=reorder(word, n))%>%
  ggplot(aes(word, n, fill=sentiment))+
  geom_col(show.legend = F)+
  facet_wrap(~sentiment,scales = "free_y")+
  labs(y="conttribution", 
       x=NULL)+
  coord_flip()
## Joining, by = "word"
## Selecting by n

from the sentiment analysis we can see that people who are older (30s) have less fear emotion and more joy emotion than people who are in their 20s. It might be attributed to as people getting more sophisticated they can better distinguish joy between fear. However, we notice that for the same word “money”, elder people tend to be anger on it. Maybe it is because people realize money is evil? But if we spell back evil, it is live…That’s a bonus finding:)